An analysis of FRE @ BC8 SympTEMIST track: named entity recognition

Abstract This paper is a more in-depth analysis of the approaches used in our submission (Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track: Named Entity Recognition Zenodo.) to the ‘SympTEMIST’ Named Entity Recognition (NER) shared subtask at ‘BioCreative 2023’. We participated on the challenge submitting two systems based on a RoBERTa architecture LLM trained on Spanish-language clinical data available at ‘HuggingFace’ model repository. Before choosing the systems that would be submitted, we tried different combinations of the techniques described here: Conditional Random Fields and Byte-Pair Encoding dropout. In the second system we also included Sub-Subword feature based embeddings (SSW). The test set used in the challenge has now been released (López SL, Sánchez LG, Farré E et al. (2024) SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction. Zenodo), allowing us to analyze more in depth our methods, as well as measuring the impact of introducing data from CARMEN-I (Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: Clinical Entities Annotation Guidelines in Spanish. Zenodo) corpus. Our experiments show the moderate effect of using the Sub-Subword feature based embeddings and the impact of including the symptom NER data from the CARMEN-I dataset. Database URL: https://physionet.org/content/carmen-i/1.0/


Introduction
Named Entity Recognition (NER) is one of the cornerstones of text mining.It is particularly useful when applied to the clinical context where Electronic Health Record (EHR) often consists of many unstructured clinical notes containing entities, such as diseases, procedures, drugs, and symptoms.The case for symptoms is particularly challenging since a single symptom can be written in many ways with varying degrees of detail.NER is necessary to go from unstructured information to structured information to perform downstream tasks.The performance of the downstream task directly depends on the performance of the NER task.
The SympTEMIST task at BioCreative VIII evaluation initiative was structured into three sub-tracks for symptom detection: Named Entity Recognition, Normalization and Entity Linking, and Multilingual Normalization.For these reasons mentioned in the previous paragraph, we found the symptom NER subtask of particular interest to us.The gold standard data is freely available at: https://zenodo.org/doi/10.5281/zenodo.8223653.
NER, as a classical Natural Language Processing (NLP) task, has a long history.Besides simple n-gram matching, a popular approach to NER was Hidden Markov Models [4,5].An improvement over Hidden Markov Model was applying Conditional Random Fields (CRFs) [6,7].With the popularization of Deep Learning (DL), Recurrent Neural Networks (RNNs) became popular for NER [8].
NER models are trained with hand-labelled (gold standard) data.This kind of data is costly to produce and therefore usually exists in limited amount.However, DL networks usually need substantial amounts of data to start producing satisfactory results.Because of this, large language models (LLMs), such as BERT [9], RoBERTa [10], became popular for NER.These models are trained on unlabelled data and serve as a basis for other downstream tasks.Nowadays, LLM-based solutions are among the most popular, and so was the case for the SympTEMIST challenge, as noted by the overview paper: 'most teams used some sort of transformer-based approach' [11].
Short before planning the experiments for this article, the version 1.0 of the CARMEN-I dataset was released (Announcement website: https://www.bsc.es/news/bsc-news/carmen-i-digitizing-covid-19-medical-records-artificialintelligence).The CARMEN-I dataset includes a corpus of clinical records in Spanish language and labelled by experts.The labels included symptoms, as well as other key medical concepts such as diseases, procedures, medications, and species.Because of this match with our target task, we decided to also run experiments measuring the impact of the CARMEN-I Spanish-language symptom annotations on our results.Our goal was to see if mixing and matching this data (that was produced under different annotation criteria) would improve the performance of our solutions.
Our submitted systems are based on a RoBERTa architecture LLM trained on Spanish-language clinical data available at 'HuggingFace' model repository (Model name: 'PlanTL-GOB-ES/roberta-base-biomedical-clinical-es').The techniques that we used are CRFs, BPE-dropout, and Sub-Subword feature-based embeddings (SSW) for one of the systems.All these techniques will be briefly introduced in the 'Techniques' section.
The data used for the experiments in this article are available in Zenodo for the SympTEMIST dataset, at https:// doi.org/10.5281/zenodo.8223654;and in PhysioNet for the CARMEN-I dataset, at https://doi.org/10.13026/bxrx-y344(for credentialed users who sign the DUA).

Techniques
This section describes the strategies used for our NER models.The first two, 'CRF' and 'BPE dropout', were used for both submissions.One of the systems used the 'sub-subword features' technique, while the other one did not.

Conditional random fields
CRF can model the probability of transitioning from one output label to the next one.It does so by training an additional matrix (called transition matrix) in conjunction with the model.The 'Viterbi' algorithm is usually applied to consider different output sequences on prediction.In NER tasks following a schema such as BIO, Beginning-Inside-Outside [12], CRF can help avoid impossible transitions.The original BERT paper [9] demonstrated the usability of BERT for NER, but it did not use CRF.Later authors, such as [13], showed that CRF improved the results in some cases.

BPE dropout
BPE dropout [14] can regularize NLP models by varying the way text is represented, resulting in an effect similar to data augmentation.It was introduced as an alternative to reference [15], where they found that the main drawback to the subword regularization method is its complexity since it requires training a unigram language model and it uses 'Expectation-Maximization' (EM) and 'Viterbi' [16] algorithms to sample segmentations.
One of the benefits of 'BPE dropout' is that it works on 'BPE' vocabulary models [17].BPE is frequently used by 'RoBERTa', and as such, we did not need to rebuild the vocabularies.In comparison, the 'unigram language model subword regularization' method uses a statistical model and dynamic programming to be able to sample different segmentations from the same sequence.BPE dropout uses random noise to discard certain merge-operations, randomly generating a different sequence of subwords each time.This is so because BPE does not store the frequencies of each subword, only the order of the merge-operations.Merge-operations are discarded with a probability p, which is usually 0.1.Provilkov et al. [14] concluded through several experiments that BPE dropout achieves better results.Our systems used 'BPE dropout' during training, with a dropout probability p of 0.1.

Sub-subword features
We used the Sub-subword feature method [18] in one of the systems to expose the character-level information to the network.According to [19], the sub-subword features method helps regularize the systems with little training data.The method consists in building the embedding matrices from the n-gram features of the subwords in the vocabulary.The features used to produce the embeddings are selected by an algorithm before training, and the neural network that produces the embeddings is trained with the rest of the model.
Since we used a RoBERTa LLM to build the NER models, we did not want to discard its (sub-)word embeddings.
Before training the NER model using the sub-subword features embeddings, we fit the feature-to-embedding (FTE) network to produce embeddings similar to those included with the RoBERTa model.We used 'Mean Squared Error' (MSE) training for this purpose.After this step, the NER model was used normally (using CRF and BPE dropout).
This technique was originally proposed for NMT, and our participation on the SympTEMIST task was the first time that it was used for NER.The size of the FTE network was three layers of 3072 units in the hidden layers, as used by [19].

Previous experiments
In order to choose the best approach for our submissions, we performed some experiments using the provided training data.The data provided contained 750 documents.The documents were segmented into sentences using Spanish-language NLTK 'punkt'.We avoided splitting sentences when that would split a labelled entity.After sentence segmentation the dataset contained 12 009 sentences.Of these sentences we made a training, validation and test datasets that contained 11 009, 500 and 500 sentences, respectively.
We used BIO encoding for the entities.In preliminary experiments we did not find any benefit in using S-or E-tags.We first tried using a SoftMax layer on top of an LLM model.We tried different Spanish-language models available at 'Hug-gingFace' and finally the model by [20] gave best results for us, with 65.78% F1 score.We used BPE dropout to improve the F1 score to 72%.
We observed that our models were producing invalid transitions, such as outputting 'I-SINTOMA' labels without a preceding 'B-SINTOMA'.For this reason, we decided to try using CRF on top of the LLM-based NER model, which improved the F1 score.Since our predictions were still producing invalid transitions, we initialized the CRF transition matrix to disallow O-to I-transitions.The introduction of this bias gave us the best results.
We also tried using the Sub-subword features approach described in the techniques section.This did not improve the F1 score for us.
We trained all the models for 25 epochs with batches of 15 sentences and learning rate of 2e-5.'AdamW' optimizer was used, keeping the best model according to our validation data.The results of these preliminary experiments are summarized in the first column of Table 2. Unlike the other reported results, these results were computed on our custom test set, randomly partitioned from the training data.
Since multiple submissions were allowed for each team, we submitted two systems corresponding to the CRF + BPEd + bias and CRF + BPEd + bias + SSWF from Table 2 but trained on the whole training data for a fixed number of four epochs.We decided to run for four epochs because we observed from the preliminary experiments that, for these configurations, the best performing model was usually found at epoch 4 for all initialization seeds.
We reproduced the official results for our two submissions in Table 1, together with the results of the best-performing submission for strict evaluation (an ensemble model [21] from ICB team) and for overlapping evaluation (an ensemble model [22] from BIT.UA team).The scores P and R stand for precision and recall.The scores prefixed by 'o_' show their overlapping counterpart.We only considered strict F1 score to optimize our models.The best scores are highlighted in bold and second-best in underline.
Although our submissions were not among the best with respect to the F1 score they did get the best recall scores for strict and overlapping evaluation.On overlapping evaluation, our models had better F1 score than the best model from strict evaluation, which was optimized for precision.The overlapping F1 of our models was close to the best performing model from team BIT.UA.

New experiments
A new version of the SympTEMIST dataset was released after the completion of the challenge [2].This new version included the held-out test set and normalized data.With this new data, we repeated the experiments evaluating the results on the provided test data.
The hyperparameters for the new set of experiments were as described for the previous set.We used 1000 sentences from the training data for validation and chose the model producing the best validation.We trained this model 1 extra epoch using a combination of the training data and validation data.This is different from the four epoch approach that we took for the challenge submissions.
We observed that the models roberta-base-biomedical-es (Model name: 'PlanTL-GOB-ES/roberta-base-biomedical-es') and bsc-bio-es (Model name: 'PlanTL-GOB-ES/bsc-bio-es') were used by other participants.We compare these two pretrained LLMs.We also experimented with adding NER training data from the CARMEN-I dataset.
We report the mean and standard deviation of the F1 score, as well as the minimum and maximum scores, for each model configuration.The results are summarized in Table 2.
The trend that we observed with the previous set of experiments is repeated and each of the added techniques improves the result except for the sub-subword features, which had a negative impact on F1 scores.We also see that the results that we obtained are different from the official results.We cannot explain this difference, but it may be related to a different pre-/post-processing or differences in the evaluation code.

Table 1. Results reported by the organizers
The roberta-base-biomedical-es (RBBE)-based models performed slightly, but consistently better than bsc-bio-es (BBE)based ones.
Including the data from CARMEN-I generally improved the results, but just slightly in most cases.The reason for the lack of larger improvements may be the different nature of the texts in CARMEN-I.

Model ensembling experiments
The best performing model [21] in the official ranking was an ensemble of multiple models using a simple majority voting approach.We also tried this approach using the models from Table 2. Since we trained three runs for each model configuration, we tried combining the three of them for the CRF + BPEd + bias configuration (-sswf) and CRF + BPEd + bias + SSWF configuration (+sswf).
We did these for both RBBE and BBE.These ensembles use three models.The results are displayed in Table 3.The (-sswf + sswf) row combines the models from the two previous columns, and thus, the cells (RBBE,-sswf + sswf) and (BBE,-sswf + sswf) each ensemble six models.The RBBE + BBE column is the combination of the two previous columns, so its cells use 6, 6, and 12 models, respectively.We repeated this for the models using training data from CARMEN-I corpus.
The ensemble models improve the results of their corresponding base models in all cases, also when all the models used the same configuration.However, we observe that the improvement is larger when the models are of different configurations.

Conclusions
Our experiments confirmed the efficacy of well-stablished NER techniques.Our experimental SSWF technique did not behave as well as we had expected, but it did improve the results when combined with other models in an ensemble setting.Using the extra data from CARMEN-I did generally improve the result in spite of the format difference in the source text data.